{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to DataFrames\n",
    "**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), May 23, 2018**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "using DataFrames # load package"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load and save DataFrames\n",
    "We do not cover all features of the packages. Please refer to their documentation to learn them.\n",
    "\n",
    "Here we'll load `CSV` to read and write CSV files and `JLD`, which allows us to work with a Julia native binary format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "using CSV\n",
    "using JLD"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's create a simple `DataFrame` for testing purposes,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th><th>D</th></tr></thead><tbody><tr><th>1</th><td>true</td><td>1</td><td>missing</td><td>'a'</td></tr><tr><th>2</th><td>false</td><td>2</td><td>b</td><td>missing</td></tr><tr><th>3</th><td>true</td><td>missing</td><td>c</td><td>'c'</td></tr></tbody></table>"
      ],
      "text/plain": [
       "3×4 DataFrames.DataFrame\n",
       "│ Row │ A     │ B       │ C       │ D       │\n",
       "├─────┼───────┼─────────┼─────────┼─────────┤\n",
       "│ 1   │ true  │ 1       │ \u001b[90mmissing\u001b[39m │ 'a'     │\n",
       "│ 2   │ false │ 2       │ b       │ \u001b[90mmissing\u001b[39m │\n",
       "│ 3   │ true  │ \u001b[90mmissing\u001b[39m │ c       │ 'c'     │"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = DataFrame(A=[true, false, true], B=[1, 2, missing],\n",
    "              C=[missing, \"b\", \"c\"], D=['a', missing, 'c'])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and use `eltypes` to look at the columnwise types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4-element Array{Type,1}:\n",
       " Bool                           \n",
       " Union{Int64, Missings.Missing} \n",
       " Union{Missings.Missing, String}\n",
       " Union{Char, Missings.Missing}  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "eltypes(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's use `CSV` to save `x` to disk; make sure `x.csv` does not conflict with some file in your working directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CSV.Sink{Void,DataType}(    CSV.Options:\n",
       "        delim: ','\n",
       "        quotechar: '\"'\n",
       "        escapechar: '\\\\'\n",
       "        missingstring: \"\"\n",
       "        dateformat: nothing\n",
       "        decimal: '.'\n",
       "        truestring: 'true'\n",
       "        falsestring: 'false', IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=0, maxsize=Inf, ptr=1, mark=-1), \"x.csv\", 8, true, String[\"A\", \"B\", \"C\", \"D\"], 4, false, Val{false})"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "CSV.write(\"x.csv\", x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can see how it was saved by reading `x.csv`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A,B,C,D\n",
      "true,1,,a\n",
      "false,2,b,\n",
      "true,,c,c\n"
     ]
    }
   ],
   "source": [
    "print(read(\"x.csv\", String))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also load it back. `use_mmap=false` disables memory mapping so that on Windows the file can be deleted in the same session."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th><th>D</th></tr></thead><tbody><tr><th>1</th><td>true</td><td>1</td><td>missing</td><td>a</td></tr><tr><th>2</th><td>false</td><td>2</td><td>b</td><td>missing</td></tr><tr><th>3</th><td>true</td><td>missing</td><td>c</td><td>c</td></tr></tbody></table>"
      ],
      "text/plain": [
       "3×4 DataFrames.DataFrame\n",
       "│ Row │ A     │ B       │ C       │ D       │\n",
       "├─────┼───────┼─────────┼─────────┼─────────┤\n",
       "│ 1   │ true  │ 1       │ \u001b[90mmissing\u001b[39m │ a       │\n",
       "│ 2   │ false │ 2       │ b       │ \u001b[90mmissing\u001b[39m │\n",
       "│ 3   │ true  │ \u001b[90mmissing\u001b[39m │ c       │ c       │"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y = CSV.read(\"x.csv\", use_mmap=false)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When loading in a `DataFrame` from a `CSV`, all columns allow `Missing` by default. Note that the column types have changed!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4-element Array{Type,1}:\n",
       " Union{Bool, Missings.Missing}  \n",
       " Union{Int64, Missings.Missing} \n",
       " Union{Missings.Missing, String}\n",
       " Union{Missings.Missing, String}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "eltypes(y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's save `x` to a file in a binary format; make sure that `x.jld` does not exist in your working directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "save(\"x.jld\", \"x\", x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After loading in `x.jld` as `y`, `y` is identical to `x`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th><th>D</th></tr></thead><tbody><tr><th>1</th><td>true</td><td>1</td><td>missing</td><td>'a'</td></tr><tr><th>2</th><td>false</td><td>2</td><td>b</td><td>missing</td></tr><tr><th>3</th><td>true</td><td>missing</td><td>c</td><td>'c'</td></tr></tbody></table>"
      ],
      "text/plain": [
       "3×4 DataFrames.DataFrame\n",
       "│ Row │ A     │ B       │ C       │ D       │\n",
       "├─────┼───────┼─────────┼─────────┼─────────┤\n",
       "│ 1   │ true  │ 1       │ \u001b[90mmissing\u001b[39m │ 'a'     │\n",
       "│ 2   │ false │ 2       │ b       │ \u001b[90mmissing\u001b[39m │\n",
       "│ 3   │ true  │ \u001b[90mmissing\u001b[39m │ c       │ 'c'     │"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y = load(\"x.jld\", \"x\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the column types of `y` are the same as those of `x`!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4-element Array{Type,1}:\n",
       " Bool                           \n",
       " Union{Int64, Missings.Missing} \n",
       " Union{Missings.Missing, String}\n",
       " Union{Char, Missings.Missing}  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "eltypes(y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we'll create the files `bigdf.csv` and `bigdf.jld`, so be careful that you don't already have these files on disc!\n",
    "\n",
    "In particular, we'll time how long it takes us to write a `DataFrame` with 10^3 rows and 10^5 columns to `.csv` and `.jld` files.  *You can expect JLD to be faster!* Use `compress=true` to reduce file sizes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  0.782157 seconds (688.90 k allocations: 30.828 MiB, 1.08% gc time)\n",
      "  0.018250 seconds (203.61 k allocations: 3.339 MiB)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "2-element Array{Int64,1}:\n",
       " 595307\n",
       " 154487"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bigdf = DataFrame(Bool, 10^3, 10^2)\n",
    "@time CSV.write(\"bigdf.csv\", bigdf)\n",
    "@time save(\"bigdf.jld\", \"bigdf\", bigdf)\n",
    "getfield.(stat.([\"bigdf.csv\", \"bigdf.jld\"]), :size)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, let's clean up. Do not run the next cell unless you are sure that it will not erase your important files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "foreach(rm, [\"x.csv\", \"x.jld\", \"bigdf.csv\", \"bigdf.jld\"])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Julia 0.6.0",
   "language": "julia",
   "name": "julia-0.6"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "0.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
